Skip to content

Conversation

@smalyshev
Copy link
Contributor

@smalyshev smalyshev commented Nov 6, 2024

Catch the error coming from remote cluster and mark it as PARTIAL. Looks like we need to do it in two places, since otherwise acquireAvoid will take over and cancel the whole task, which we don't want to happen.

In runtime, the following failures can happen, which need to be covered:

  • Can not send message to remote (disconnect)
  • Error when establishing exchange
  • Error during computation
  • Disconnect during computation

See also: #112886

@smalyshev smalyshev force-pushed the skip-on-fail branch 2 times, most recently from 72a0519 to a9d0b09 Compare November 7, 2024 20:26
@smalyshev smalyshev changed the title Ignore failures on skip_unavailable Ignore remote ES|QL execution failures when skip_unavailable=true Nov 12, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @smalyshev, I've created a changelog YAML for you.

Copy link
Contributor

@quux00 quux00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass review with some questions and comments.


/**
* Marks the cluster as PARTIAL and adds the exception to the cluster's failures record.
* Currently, additional failures are not recorded, TODO: check if this should be the case.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's OK to add more failures to the cluster metadata list. In search when shard level searches occur, a cluster might have multiple failures listed in the array, so feel free to do that here if there is a use case for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I'll do it later maybe, for now I think I want to deal with a single failure first without the complication of handling more than one.

import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicBoolean;

import static org.elasticsearch.xpack.esql.session.EsqlSessionCCSUtils.markClusterEmptyInfo;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking about the EsqlSessionCCSUtils is that it would be specific to plan-time operations and not used at execution time (that's why I renamed it from Costin's CcsUtils to EsqlSessionCCSUtils). So I would want us to think about if there are enough "helper" methods needed at execution time to create a "ComputeCCSUtils"? If not, then we should rename EsqlSessionCCSUtils to maybe just "EsqlCCSUtils".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now there's one method only - one that creates that "empty" cluster state (which is yes, not empty, see below) but if there's more then we could move it. I don't want to create another utils just for one method though.

@elasticsearchmachine
Copy link
Collaborator

Hi @smalyshev, I've updated the changelog YAML for you.

@smalyshev smalyshev requested a review from nik9000 December 2, 2024 19:50
@dnhatn dnhatn self-requested a review December 3, 2024 00:57
Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed the ComputerService and ComputeListener. I think we're getting closer. Thanks for your iterations on this @smalyshev.

};
// Cancel the group on sink failure
ActionListener<Void> exchangeListener = computeListener.acquireAvoid().delegateResponse((inner, e) -> {
taskManager.cancelTaskAndDescendants(groupTask, "exchange sink failure", true, ActionListener.noop());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should wait for the cancellation here. I think we should also fix the same issue in ComputerListener. Can you move this into the cancellation listener?

if (suppressRemoteFailure) {
	computeListener.markAsPartial(clusterAlias, e);
	taskManager.cancelTaskAndDescendants(groupTask, "exchange sink failure", true, ActionListener.running(() -> inner.onResponse(null));
} else {
	inner.onFailure(e);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, wouldn't that code not cancel task on failure?

@dnhatn dnhatn self-requested a review December 3, 2024 06:43
assertClusterStatusAndShardCounts(remote2Cluster, EsqlExecutionInfo.Cluster.Status.SKIPPED);
}

// skip_unavailable=true clusters are unavailable, both marked as PARTIAL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this test belongs here?

This is in the test for testUpdateExecutionInfoWithUnavailableClusters, which only ever sets status to SKIPPED, not PARTIAL.

This test should go in different method (not sure which though).

@smalyshev smalyshev added the auto-backport Automatically create backport pull requests when merged label Jan 2, 2025
@smalyshev
Copy link
Contributor Author

Superceded by #121240

@dnhatn dnhatn removed their request for review January 31, 2025 18:46
@smalyshev smalyshev closed this Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.19.0 v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants